Method and apparatus for dynamic beam control in viterbi search

ABSTRACT

A method is presented including selecting an initial beam width. The method also includes determining whether a value per frame is changing. A beam width is dynamically adjusted. The method further decides a speech input with the dynamically adjusted beam width. Also, a device is presented including a processor ( 420 ). A speech recognition component ( 610 ) is connected to the processor ( 420 ). A memory ( 410 ) is connected to the processor ( 420 ). The speech recognition component ( 610 ) dynamically adjusts a beam width to decode a speech input.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to speech recognition, and more particularly to amethod and apparatus for dynamic beam control in a Viterbi search.

2. Description of the Related Art

Speech or voice recognition has become very popular to increase workefficiency. Several techniques are used in speech recognition processesto recognize human voice. Speech recognition also functions as apipeline to convert digital audio signals coming from devices, such as apersonal computer (PC) sound card, to recognized speech. These signalsmay pass through several stages, where various mathematical andstatistical processes are used to determine what has actually been said.

Many speech recognition applications have databases containing thousandsof frequencies or “phonemes” (also known as “phones” in speechrecognition systems). A phoneme is the smallest unit of speech in alanguage or dialect (i.e., the smallest unit of sound that candistinguish two words in a language). The utterance of one phoneme isdifferent from another. Therefore, if one phoneme replaces another in aword, the word would have a different meaning. For example, if the “B”in “bat” were replaced by the phoneme “R,” the meaning would change to“rat.” The phoneme databases are used to match the audio frequency bandsthat were sampled. For example, if an incoming frequency sounds like a“T,” an application will try to match it to the corresponding phoneme inthe database. Also, adjacent phones, known as context, can effectpronunciation. For example, the “T” in “that” sounds different from the“T” in, “truck.” The phone with fixed left (right) context is generallyknows as a “left (right) biphone.” The phone with fixed left and rightcontexts is knows as a “triphone.” The phoneme databases may containmany entries for each phoneme corresponding to bi- or triphones. Eachphoneme is tagged with a feature number, which is then assigned to theincoming signal.

There can be so many variations in sound due to how words are spokenthat it is almost impossible to exactly match an incoming sound to anentry in the database. Moreover, different people may pronounce the sameword differently. Further, the environment also adds its own share ofnoise. Thus, applications must use complex techniques to approximate anincoming sound and figure out which phonemes are being used.

Another problem in speech recognition involves determining when aphoneme (or smaller units) ends and the next one begins. For problemslike this, a technique called hidden Markov model (HMM) may beimplemented. A HMM provides a pattern matching approach to speechrecognition.

An HMM is generally defined by the following elements: First, the numberof states in the model, N; next, a state-transition matrix A wherea_(ij) is the probability of the process moving from state q_(i) tostate q_(i) at time t=1, 2, . . . and given that the process is at stateq_(i) at time t−1; the observation probability distribution,b_(i)({right arrow over (o)}), i=1 . . . , N for all states, q_(i), i=1,. . . N; and the initial state probability π_(i) for i=1, . . . N.

In order to perform speech recognition using a HMM, languages aretypically broken down into a limited group of phonemes. For example, theEnglish language may be broken down into approximately 40-50 phonemes.One should note, however, that if other units are used, such astri-phones, the limited group may consist of several thousands oftri-phones. A stochastic model of each of the units (i.e., phones) isthen created. Given an acoustical observation, the most likely phonemecorresponding to the observation can then be determined. One shouldnote, however, that if context units are used, such as bi-phones ortri-phones, the limited group may consist of several thousands of units.Therefore, a stochastic model for each of the units would be created. Amethod for determining the most likely phoneme corresponding to theacoustical observation uses Viterbi (named after A. J. Viterbi) scoring.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is illustrated by way of example and not by way oflimitation in the figures of the accompanying drawings in which likereferences indicate similar elements. It should be noted that referencesto “an” or “one” embodiment in this disclosure are not necessarily tothe same embodiment, and such references mean at least one.

FIG. 1 illustrates four stages of a typical Viterbi search algorithm.

FIG. 2 illustrates a block diagram of an embodiment of the inventionhaving dynamic beam control.

FIG. 3 illustrates a block diagram of an embodiment of the inventionhaving dynamic beam control (when there are enough active paths) orN-best decoding (if there are too few or too many active paths).

FIG. 4 illustrates a typical system that can be used for speechrecognition tasks.

FIG. 5 illustrates an embodiment of the invention having a dynamic beamcontrol process included in a system.

FIG. 6 illustrates an embodiment of the invention having a dynamic beamcontrol circuit.

DETAILED DESCRIPTION OF THE INVENTION

The invention generally relates to a method for dynamic beam control ina Viterbi search. Referring to the figures, exemplary embodiments of theinvention will now be described. The exemplary embodiments are providedto illustrate the invention and should not be construed as limiting thescope of the invention.

FIG. 1 illustrates four stages of typical Viterbi search algorithm. ThisViterbi method compares the acoustical observation against a stochasticmodel of a phoneme in order to determine a probability that thisobservation corresponds to that phoneme. Viterbi scoring determines thesingle best state sequence, i.e., the state transition path that yieldsthe highest probability of the observation matching the model. This samedetermination is performed for each of the 40-50 phonemes (or othersmaller units, such as tri-phones) of the English language. In thismanner, the phoneme with the highest probability of matching theacoustical observation is determined. If the acoustical observationincludes more than one phoneme, such as the spoken word, then the abovetechnique can be repeated to determine a set of the most likely phonemescorresponding to the acoustical observation. In a speech recognitionsystem, after a model is completed, for example the decoding of the work“ask,” the system would choose all possible continuations, concatenatethe HMMs corresponding to the continuations, and decoding wouldcontinue. For example, after decoding the second tri-phone “AE−S+K,” thespeech recognition system could choose tri-phones “S−K+*” (where *represents any other phone). Then, after decoding a portion of the inputdata (e.g., corresponding to a sentence or several words), the speechrecognition system chooses the best hypotheses as a result.

The Viterbi algorithm is commonly used in HMM speech recognition. TheViterbi search algorithm is typically used to find the best wordsequence that matches the speech to be recognized. Here, the matching isin terms of statistical likelihood values, and the search is through allpossible word sequences. For large vocabulary systems, the search spaceincreases dramatically. To be precise, the search space increases asN^(L), where N is the vocabulary number and L is the length of thehypothesized sentence.

To avoid search space explosion, beam control is a technique used in thesearch algorithm to prune out unlikely hypotheses or search paths. Abeam search requires the computation of a reference score referred to asthe least upper bound (LUB) on the log probability of the most likelyhypothesis. The word score refers to a negative logarithmic probability;high scores typically mean low probabilities and low scores typicallymean high probabilities. The use of beam control in continuous speechrecognition (CSR) systems makes use of a static value, the beam width,to control the likelihood range (according to the best likelihood value)that all search paths may have at a certain time.

The four stages comprising the conventional Viterbi search algorithm arepre-processing stage 110, initialization stage 120, recursion stage 130,and termination stage 140. One should note, that in speech recognitionsystems, pre-processing stage 110 may not be completely performed beforethe other stages execute. There may be several reasons thatpre-processing stage 110 may not complete before the other stages, suchas: all of the speech models that are taken into account duringrecognition are not known before recognition begins due to concatenationof Hidden Markov Models (HMMs) dependent on intermediate recognitionresults, the impossibility of listing all word sequences for continuousspeech recognition or, or on-line decoding input observation data maynot be accessible. For every speech model, φ_(n) (where φ_(n) may be allmodels for all possible utterances, the utterance HMM, or concatenatedHMMs), where n=1, . . . , M, the four stages are performed as follows:

Pre-Processing 110:{overscore (π)}i=log(πi), 1≦i≦N{overscore (b)} _(i)({overscore (o)} _(t))=log(b _(i)({overscore (o)}_(t))), 1≦i≦N, 1≦t≦T{overscore (a)} _(ij)=log(a _(ij)) 1≦i, j≦Ninitialization 120:{tilde over (δ)}₁(j)={tilde over (π)}_(j) +{tilde over (b)} _(j)(õ_(t)), 1≦j≦N_(m)recursion 130: ${{{\overset{\sim}{\delta}}_{i}(j)} = {{\begin{matrix}\max \\{l \leq i \leq N}\end{matrix}\lbrack {{{\overset{\sim}{\delta}}_{i - 1}(i)} + {\overset{\sim}{a}}_{ij}} \rbrack} + {{\overset{\sim}{b}}_{j}( {\overset{\sim}{o}}_{i} )}}},{2 \leq t \leq T},{l \leq j \leq N}$termination 140: ${\overset{\sim}{P}}_{m}^{*} = {\begin{matrix}\max \\{l \leq i \leq N}\end{matrix}\lbrack {{\overset{\sim}{\delta}}_{T}(i)} \rbrack}${tilde over (p)} _(m)*=_(1≦i≦N) ^(max)[{tilde over (δ)}_(T)(i)]where the score{tilde over (δ)}_(t)(j)is an approximation of the logarithm of the probability for the mostprobable path passing node j at the time t and {tilde over (p)}_(m)* isthe logarithm of the probability for the most probable path ending atnode N at time T. The resulting recognition (i.e., the word to which theunknown speech signal corresponds) is{circumflex over (λ)}=λ_(m)where$m = {\arg\limits_{m}\quad{\max\limits_{l \leq m \leq M}{{\overset{\sim}{P}}_{m}^{*}.}}}$

In pre-processing stage 110, logarithmic values of the initial stateprobability π_(i) for i=1, . . . , N, the description of the featureprobability distributionb_(i)({right arrow over (o)}_(t))where1≦i≦Nand1≦t≦Tand the state transition probabilities a_(ij), where i≧1 and j is ≦N,are computed and stored in memory. The function b_(j)(o) and the valuesof the initial state probability π_(i) and the state transitionprobabilities a_(ij) generally depend upon the particular speech modelφ_(m) being considered. In order to decrease the amount of datadescribed in the models, however, some constants are set equalregardless of the model. For example, initial state probabilities may beset to π₁=1, π_(i)=0 when i>1 for all speech models. The values that aredetermined during the preprocessing stage are sometimes computed andstored once.

In initialization stage 120, the path scores {tilde over (δ)}₁(i) arecalculated at time 1 for state i=1, . . . N, at time t=2, . . . T andstate j=1, . . . N.

In recursion stage 130, the scores δ_(t)(j) are calculated for state i,ranging from 1 to N, at time t where 2≦t≦T and state j, where 1 is ≦j≦N.During termination stage 140, the highest probability result (or bestpath score) for each specific model is determined from the calculationsobtained in recursion stage 130. An overall best path score is obtainedby comparing the best path scores obtained for each model.

In one embodiment of the invention, a beam control mechanism dynamicallyadjusts the beam width according to clues during recognition to improveperformance. At the beginning of speech recognition, a wider beam widthshould be used due to the uncertainty of what was spoken, while anarrower beam can be used as more clues are obtained. In other words, aconstant beam width may work well for one speech utterance, but may notwork well for the middle of the utterance or may be too large for otherparts of speech utterances, which results in processing many uselesshypotheses. Therefore, according to one embodiment of the invention, abeam control method dynamically adjusts the beam width according toclues learned during recognition.

One embodiment of the invention will be described as follows. Let Φ_(t)denote a set of active paths of time t. Let N_(t) be defined as thenumber of paths in Φ_(t). Let p(φ) be defined as the likelihood value ofany path φ∈Φ_(t). Let α_(t) be defined as the best likelihood valuewithin Φ_(t), α₀=0. Then,$\alpha_{t} = \underset{\phi \in \Phi_{t}}{\max\quad{p(\phi)}}$

Also let β be defined as the beam width.

In one embodiment of the invention, Φ_(t) with the beam width β, meansdiscarding those paths that have p(φ)<(α_(t)−β) (i.e., pruning). In oneembodiment of the invention, the beam width β_(t) of time t is chosenproportionally to some initial beam width B as: β_(t)=b_(t)×B, whereb_(t) is defined as follows: $b_{t} = \{ {{{\begin{matrix}b_{t} & {{{if}\quad b_{t}^{\prime}} < b_{1}} \\b_{2} & {{{if}\quad b_{t}^{\prime}} > b_{2}} \\b_{t}^{\prime} & {otherwise}\end{matrix}{where}\quad b_{0}^{\prime}} = {b_{1}^{\prime} = B}},{b_{t} = {\frac{\alpha_{i - 1} - \alpha_{t - 2}}{\alpha_{i - 1}} \times t}},{t > 2}} $and [b₁, b₂] is the range of coefficients, which are determinedheuristically. For example, the initial values for α_(t) and b_(t) canbe α₁=0 and b₀=1. Then, the beam width, β_(t), is dynamically adjustedto reduce processing time while keeping word error rate (WER) low.

FIG. 2 illustrates a block diagram of an embodiment of the inventionhaving dynamic beam width process 200. After initially testing a speechrecognition system, an optimal parameter set is selected based onoptimal decoding speed and WER. The decoding speed and WER aredetermined based on the number of active hypotheses. For instance, themore active hypotheses, the slower the decoding rate and less number oferrors.

In a beam search, all hypotheses that are worse on the chosen beam widththan the best hypotheses are pruned. In block 210, process 200initializes the beam width to a predetermined value. The predeterminedvalue is based on the optimal parameter set. Block 220 retrieves thenext speech frame (observation). Block 230 determines the bestlikelihood of the best hypotheses. Block 240 determines the value of thebeam width to determine hypotheses for the current speech frame. Ifblock 240 determines that the likelihood value per frame is increasing(the score grows slower), process 200 decreases the beam width from thepre-selected initial beam width to rise in decoding speech.

Block 240 also determines if the likelihood value per frame isdecreasing (the score is growing faster). If block 240 determines thatthe likelihood value per frame is decreasing then block 240 increasesthe beam width. In one embodiment of the invention, block 240decreases/increases the beam width by a user selected increment. Inanother embodiment of the invention, the speech recognition systemautomatically decreases/increases the beam width by a small incrementbased on chosen percentage, such as 10%. It should be noted that othermethods of decreasing or increasing the beam width may be implementedwithout deviating from the various embodiments of the invention.

Process 200 continues with block 250. Block 250 propagates all activepaths with the new beam width (dynamically modified). Block 260determines if the speech utterance decoding is complete. If block 260determines that decoding is not complete, process 200 continues withblock 220. If block 260 determines that the decoding is completed,process 200 continues with block 270. Block 270 determines that the bestpath is the speech recognition result.

With process 200 increasing, decreasing, or maintaining the current beamwidth while decoding speech input, WER is not compromised whileincreasing the rate at which decoding completes (i.e., decreasingdecoding time).

Table I illustrates results from an example using an embodiment of theinvention having a dynamic beam control process for a Chinese languagespeech recognition task. One should note that other languages may alsobe input for recognition purposes. In Table I, it can be seen that byusing an embodiment of the invention having dynamic beam control on theChinese language recognition task, the embodiment of the inventionimproved speed by sixty percent (60%), i.e. real time rate from 3.14 to1.24. One should note that the real time rate is the central processingunit (CPU) time required for decoding completion divided by the durationof speech. In other words, if the real time rate is less than 1, thenon-line decoding (decoding with the speed of a person that is speaking)is possible. TABLE I Dynamic Beam Control for Chinese Task Static BeamDynamic Beam Real-time Rate 3.14 1.24 Word Error Rate 8.4 8.3

Table II illustrates results from an example of using an embodiment ofthe invention having dynamic beam control for an English language speechrecognition task. In Table II, the embodiment of the invention improvedspeed on the English language task by forty-five percent (45%) (realtime rate from 3.4 to 1.85). In the Chinese language task and theEnglish language task, no significant increase in WER is observed. Notethat the results illustrated in Table I and in Table II were achievedusing a 550 MHz Intel Pentium® processor machine having a cache memoryof 512 K and a 512 megabyte synchronous dynamic random access memory(SDRAM). One should note that other systems may also be used withembodiments of the invention having different processing speeds andmemory.

For the Chinese language and English language task examples, the initialbeam value was set to 140 and 180, respectively. The same co-efficientrange [b₁, b₂] was used and set to [0.5, 1.05]. TABLE II Dynamic BeamControl for English Task Static Beam Dynamic Beam Real-time Rate 3.41.85 Word Error Rate 11 11.4

In one embodiment of the invention, a dynamic beam control is used in aViterbi search having the beam width adjusted only if there exists anormal number of active paths (not too many or too few). For thisembodiment, let β^(t) _(N) denote a beam width, where there will beexactly N active paths left in Φ_(t) after pruning. The beam width β_(t)of time t is chosen as follows: $\begin{matrix}{\beta_{t} = \{ \begin{matrix}{{\infty,{N_{t} < {2N_{1}}}}\quad} \\{{\beta_{N_{2}}^{\prime},{{2N_{1}} \leq N_{1} < {2N_{2}}}}\quad} \\{{b_{t} \times B},{{2N_{2}} \leq N},{\leq {2N_{3}}}} \\{{\beta_{N_{2}}^{\prime},{{2N_{1}} < N_{t}}}\quad}\end{matrix} } & (2)\end{matrix}$where 0<N₁<N₂<N₃ are pre-defined thresholds and B is the initial beamwidth. The value b_(t) is estimated from the following:$b_{t} = \{ \begin{matrix}b_{t} & {{{if}\quad b_{t}^{\prime}} < b_{1}} \\b_{2} & {{{if}\quad b_{t}^{\prime}} > b_{2}} \\b_{t}^{\prime} & {otherwise}\end{matrix} $

In this embodiment of the invention, the beam width is adjusted only ifthere exists a normal number of active paths (not too many or too few),namely in the range of [2N₂, 2N₃]. In this embodiment of the invention,if the number of active paths falls out of the range [2N₂, 2N₃], thenpruning is performed to ensure the number of active paths fall back intothe range. If the total number of active paths is less than a thresholdN₁, no pruning is performed [i.e., infinity beam width].

FIG. 3 illustrates an embodiment of the invention having process 300that adjusts the beam width during decoding only if there exists anormal number of active paths (not too many or too few) paths or usesN-best decoding when there are too few or too many active paths. In anN-best decoding method, for every time frame, only N-best hypotheseswill continue, while other hypotheses are pruned. Therefore, the list ofN-best hypotheses can be re-scored. Process 300 begins with block 310that determines the initial beam width.

The initial beam width is determined by performing speech recognition ona sample of the speech input. Process 300 continues with block 320 thatdetermines boundary coefficients. After initially testing a speechrecognition system, an optimal parameter set is selected based onoptimal decoding speed and WER. The decoding speed and WER aredetermined based on the number of active hypotheses. Process 300continues with block 330 that determines statistics on a sample of theinput speech. Once block 330) has initially ran a sample of the speechinput and determines statistics on this input, block 340 then sets athreshold for active hypotheses. Note that the thresholds are setaccording to equation 2. In one embodiment of the invention thethreshold 2N₃ is such a number of hypotheses that this threshold isexceeded for only approximately 10% of the time frames. In oneembodiment of the invention the threshold 2N₂ is such a number ofhypotheses that for approximately 10% of the time frames they are lessthan 2N₂ hypotheses. One should note that other thresholds may be usedin other embodiments of the invention.

In one embodiment of the invention, the threshold N_(t) is set equal toN₂/5. Note that threshold N₁ is used for critical cases when there arevery few hypotheses. In one embodiment of the invention, blocks 310-340can be separated from process 300 and can be performed during speechrecognition building and testing or can be performed while the speechrecognition system is adapted (implicitly or explicitly) to a speakerand/or the environment.

Process 300 continues with block 350 where the next observation (orutterance) is retrieved. Process 300 continues with block 355. Block 355determines whether the number of active hypotheses is greater than 2N₃.If block 355 determines that the number of active hypotheses is greaterthan 2N₃, then process 300 continues with block 356. Block 356 performsan N-best decoding with N set to N₃. For this case, it is sufficient tokeep the same WER.

By setting N=N₃, decoding is accelerated. If block 355 determines thatthe number of active hypotheses is not greater than 2N₃, then process300 continues with block 365.

Block 365 determines whether the number of active hypotheses is greaterthan or equal to 2N₂. If block 365 determines that the number of activehypotheses is greater than or equal to 2N₂, then process 300 continueswith block 366. Block 366 determines the best likelihood α_(t) and thedynamic beam width b, as presented in process 200. Process 300 thencontinues with block 367 where all paths with a likelihood better thanα_(t)−b_(t) are propagated Process 300 then continues with block 380.

Block 380 determines whether the next observation was the lastobservation at time t=t+1. If block 380 determines that the observationis the last observation, then process 300 continues with block 350. Ifblock 380 determines that the observation at time t=t+1 is not the lastobservation, process 300 continues with block 385. Block 385 then usesthe best path as the result of the speech recognition process.

If block 365 determines that the number of active hypotheses is notgreater than or equal to 2N₂, then process 300 continues with block 375.

Block 375 determines whether the number of active hypotheses is greaterthan or equal to than 2N₁. If block 375 determines that the number ofactive hypotheses is greater than or equal to −2N₁, then process 300continues with block 376. Block 376 decodes the active hypotheses withN-best decoding with N set equal to N₂. Process 300 then continues withblock 380. If block 375 determines that the number of active hypothesesis not greater than or equal to 2N₁, then process 300 continues withblock 390. Block 390 then propagates all hypotheses.

Table III illustrates comparison results from statistical speechrecognition using Viterbi search for a static beam, an embodiment of theinvention having a dynamic beam control process (process 200), and anembodiment of the invention using dynamic beam control where the beamwidth is adjusted only if there exists a normal number of active paths(not too many or too few) paths or uses N-best decoding when there aretoo few or too many active paths (Process 300), for a task ofrecognizing Chinese language. One should note that other languages mayalso be used in recognition tasks with embodiments of the invention.

In Table III, the results illustrated for an embodiment of the invention(modified dynamic beam width process; third column) having dynamic beamcontrol (process 200) and dynamic beam control where there exists anormal number of active paths (not too many or too few) paths or usesN-best decoding when there are too few or too many active paths activepaths (process 300), the first pass only used an embodiment of theinvention having dynamic beam control (200) and the second pass used anembodiment of the invention having dynamic beam control where the beamwidth is adjusted only if there existed “enough” active paths (process300).

In Table III, using an embodiment of the invention having a modifieddynamic beam process (third column) achieved WER improvement from 8.3 to7.8, compared with an embodiment of the invention only having dynamicbeam control without adjusting the beam width only when there exists anormal number of active paths (not too many or too few) paths or thatuses N-best decoding when there are too few or too many active pathsactive paths. The results of the example illustrated in Table III is foran embodiment of the invention using the same parameters as theembodiment of the invention whose results are illustrated in Table I,with the parameters for the embodiment using both dynamic beam controland dynamic beam control where the beam width is adjusted only if thereexists a normal number of active paths (not too many or too few) pathsor uses N-best decoding when there are too few or too many active pathsof N₁=160, N₂=800 and N₃=5000. TABLE III Dynamic Beam Control forChinese Task Modified Dynamic Static Beam Dynamic Beam Beam Real-timeRate 3.14 1.24 1.32 Word Error Rate 8.5 8.3 7.8

Table IV illustrates a comparison of results for an English languagetask of a Viterbi search using a static beam, an embodiment of theinvention having dynamic beam control, and an embodiment of theinvention having a modified dynamic beam control process where the beamwidth is adjusted only if there exists a normal number of active paths(not too many or too few) paths or uses N-best decoding when there aretoo few or too many active paths active paths. One should note thatother language recognition tasks can be used with embodiments of theinvention.

The results illustrated in Table IV for an embodiment of the inventionuse the same parameters as the embodiment of the invention whose exampleresults are illustrated in Table II, with the following parameters forthe embodiment of the invention having dynamic beam control and dynamicbeam control where the beam width is adjusted only if there exists anormal number of active paths (not too many or too few) paths or usesN-best decoding when there are too few or too many active paths ofN₁=300, N₂=1500 and N₃=6000. The results in Table IV illustrate that animproved WER of 11.4 to 11 resulting from comparison of the embodimentof the invention having dynamic beam control for first pass and dynamicbeam control where the beam width is adjusted only if there exists anormal number of active paths (not too many or too few) paths or usesN-best decoding when there are too few or too many active paths forsecond pass. The results illustrated in Tables III and IV were measuredusing a 550 MHz Intel Pentium™ processor machine with 512 K cache and512 Mb SDRAM. One should note that other processor speeds and memoryconfigurations may also be used with embodiments of the invention. TABLEIV Dynamic Beam Control for English Task Modified Dynamic Static BeamDynamic Beam Beam Real-time Rate 3.4 1.85 1.95 Word Error Rate 11 11.411

It should be noted that the above discussed embodiments of the inventioncan be applied to tasks where the number of hypotheses are too large tobe determined in a reasonable amount of time, and where beam widthpruning should normally be applied.

FIG. 4 illustrates a typical system 400 that may be used for speechrecognition applications. System 400 comprises memory 410, centralprocessing unit (CPU) plus local cache 420, north bridge 430, southbridge 435, audio out 440, and audio in 450. Audio out device 440 may bea device such as a speaker system. Audio in device 450 may be a devicesuch as a microphone.

FIG. 5 illustrates system 500 having an embodiment of the inventionincluding dynamic beam width process 510. In one embodiment of theinvention, dynamic beam width process 510 is implemented as process 200(illustrated in FIG. 2). In another embodiment of the invention, dynamicbeam width process 510 is implemented as process 300 (Illustrated inFIG. 3). In another embodiment of the invention, dynamic beam widthprocess 510 is implemented using both process 200 and process 300, whereprocess 200 is used for first pass, and process 300 is used for otherpasses. Process 510 may be implemented as an application program inmemory 410. Memory 410 may memory devices such as random access memory(RAM), dynamic RAM, or synchronous DRAM (SDRAM). It should be noted thatother memory devices may also be used, including future developments inmemory devices. It should also be noted that dynamic beam width process510 may also be implemented on other readable mediums, such as a floppydisc, compact disc read-only memory (CD-ROM), etc.

FIG. 6 illustrates an embodiment of the invention having process 610(illustrated in FIG. 5 as process 510) implemented in hardware. In oneembodiment of the invention, process 610 is implemented usingprogrammable logic arrays (PLAs). It should be noted that process 510can be implemented using other electronic devices, such as registers andtransistors. In another embodiment of the invention, process 610 isimplemented in firmware.

By using embodiments of the invention during speech recognitionprocessing using a Viterbi search method, processing speed is increasedwithout increasing WER. Therefore, less time is necessary to completespeech recognition tasks.

The above embodiments can also be stored on a device or machine-readablemedium and be read by a machine to perform instructions. Themachine-readable medium includes any mechanism that provides (i.e.,stores and/or transmits) information in a form readable by a machine(e.g., a computer). For example, a machine-readable medium includes readonly memory (ROM); random access memory (RAM); magnetic disk storagemedia; optical storage media; flash memory devices; electrical, optical,acoustical or other form of propagated signals (e.g., carrier waves,infrared signals, digital signals, etc.). The device or machine-readablemedium may include a solid state memory device and/or a rotatingmagnetic or optical disk. The device or machine-readable medium may bedistributed when partitions of instructions have been separated intodifferent machines, such as across an interconnection of computers.

While certain exemplary embodiments have been described and shown in theaccompanying drawings, it is to be understood that such embodiments aremerely illustrative of and not restrictive on the broad invention, andthat this invention not be limited to the specific constructions andarrangements shown and described, since various other modifications mayoccur to those ordinarily skilled in the art.

1. A method comprising: selecting an initial beam-width; determiningwhether a value per frame is changing; dynamically adjusting the beamwidth; and decoding a speech input with the dynamically adjusted beamwidth.
 2. The method of claim 1, said decoding including pruning a setof active paths with the dynamically adjusted beamwidth.
 3. The methodof claim 1, said decoding further including using a hidden Markov model(HMM).
 4. The method of claim 3, wherein the HMM is a Viterbi scoringsearch.
 5. The method of claim 1, said dynamically adjusting includingdetermining a first boundary coefficient and a second boundarycoefficient based on heuristics.
 6. The method of claim 1, wherein thevalue per frame is changing includes one of a likelihood of a besthypotheses per frame is increasing and the likelihood of the besthypotheses per frame is decreasing.
 7. A method comprising: determiningan initial beam width; determining a plurality of threshold values for aplurality of active hypotheses; determining a current number of activehypotheses; comparing the current number of active hypotheses with theplurality of threshold values; dynamically adjusting the beam width fora range of active hypotheses; and decoding a speech input.
 8. The methodof claim 7, said decoding including: pruning the plurality of activehypotheses with one of the dynamically adjusted beamwidth and an N-bestpruning based on the plurality of threshold values.
 9. The method ofclaim 8, said decoding further including using a hidden Markov model(HMM).
 10. The method of claim 9, wherein the HMM is a Viterbi scoringsearch.
 11. The method of claim 7, said dynamically adjusting includingdetermining a first boundary coefficient and a second boundarycoefficient based on heuristics.
 12. The method of claim 7, saiddetermining the plurality of threshold values comprises determiningstatistics on a sample of the speech input.
 13. A system comprising: aprocessor; a bus coupled to the processor; a memory coupled to theprocessor, the memory having a speech recognition process; an inputdevice coupled to the processor; and an output device coupled to theprocessor, wherein the speech recognition process dynamically adjusts abeam width to decode a speech input.
 14. The system of claim 13, thespeech recognition process decodes the speech input by pruning aplurality of active hypotheses with one of the dynamically adjustedbeamwidth and an N-best pruning based on a plurality of threshold valuesstored in the memory.
 15. The system of claim 14, wherein the speechrecognition process further decodes the speech input using a hiddenMarkov model (HMM).
 16. The system of claim 15, wherein the HMM is aViterbi scoring search.
 17. An apparatus comprising: a processor; afirst circuit to perform a speech recognition process coupled to theprocessor; and a memory coupled to the processor; wherein the speechrecognition process dynamically adjusts a beam width to decode a speechinput.
 18. The apparatus of claim 17, wherein the speech recognitionprocess decodes the speech input by pruning a plurality of activehypotheses with one of the dynamically adjusted beamwidth and an N-bestpruning based on a plurality of threshold values stored in the memory.19. The apparatus of claim 18, wherein the speech recognition processfurther decodes the speech input using a hidden Markov model (HMM). 20.The apparatus of claim 19, wherein the HMM is a Viterbi scoring search.21. The apparatus of claim 17, wherein the first circuit is aprogrammable logic array.
 22. An apparatus comprising a machine-readablemedium containing instructions which, when executed by a machine, causethe machine to perform operations comprising: selecting an initial beamwidth; determining whether a value per frame is changing; adjusting thebeam width dynamically; and decoding a speech input with the dynamicallyadjusted beam width.
 23. The apparatus of claim 22, said decodingincluding pruning a set of active paths with the dynamically adjustedbeamwidth.
 24. The apparatus of claim 23, said decoding furtherincluding using a hidden Markov model (HMM).
 25. The apparatus of claim24, wherein the HMM is a Viterbi scoring search.
 26. The apparatus ofclaim 24, said adjusting further containing instructions which, whenexecuted by a machine, cause the machine to perform operationsincluding: determining a first boundary coefficient and a secondboundary coefficient based on heuristics.
 27. The apparatus of claim 22,wherein the value per frame is changing includes one of a likelihood ofa best hypotheses per frame is increasing and the likelihood of the besthypotheses per frame is decreasing.
 28. An apparatus comprising amachine-readable medium containing instructions which, when executed bya machine, cause the machine to perform operations comprising:determining an initial beam width; determining a plurality of thresholdvalues for a plurality of active hypotheses; determining a currentnumber of active hypotheses; comparing the current number of activehypotheses with the plurality of threshold values; adjusting the beamwidth dynamically for a range of active hypotheses; and decoding aspeech input.
 29. The apparatus of claim 28, said decoding furthercontaining instructions which, when executed by a machine, cause themachine to perform operations including pruning the plurality of activehypotheses with one of the dynamically adjusted beamwidth and an N-bestpruning based on the plurality of threshold values.
 30. The apparatus ofclaim 29, said decoding further including using a Viterbi scoringsearch.